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Abstract — Cascade classifiers are widely used in real-time object detection. Different from conventional classifiers that are designed 
for a low overall classification error rate, a classifier in each node of the cascade is required to achieve an extremely high detection 
rate and moderate false positive rate. Although there are a few reported methods addressing this requirement in the context of object 
detection, there is no a principled feature selection method that explicitly takes into account this asymmetric node learning objective. 
We provide such an algorithm here. We show a special case of the biased minimax probability machine has the same formulation 
as the linear asymmetric classifier (LAC) of [Tj. We then design a new boosting algorithm that directly optimizes the cost function of 
LAC. The resulting totally-corrective boosting algorithm is implemented by the column generation technique in convex optimization. 
Experimental results on object detection verify the effectiveness of the proposed boosting algorithm as a node classifier in cascade 
object detection, and show performance better than that of the current state-of-the-art. 

Index Terms — AdaBoost, minimax probability machine, cascade classifier, object detection. 

♦ 



1 Introduction 

REAL-TIME object detection inherently involves search- 
ing a large number of candidate image regions for 
a small number of objects. Processing a single image, for 
example, can require the interrogation of well over a million 
scanned windows in order to uncover a single correct 
detection. This imbalance in the data has an impact on the 
way that detectors are applied, but also on the training 
process. This impact is reflected in the need to identify 
discriminative features from within a large over-complete 
feature set. 

Cascade classifiers have been proposed as a potential 
solution to the problem of imbalance in the data (2), (3), 
H)' (3' 00' an< ^ nave received significant attention due 
to their speed and accuracy. In this work, we propose 
a principled method by which to train a boosting-based 
cascade of classifiers. 

The boosting-based cascade approach to object detection 
was introduced by Viola and Jones |6|, |7|, and has received 
significant subsequent attention p), |9), ]10), (TTJ, Jl2), (B) . 
It also underpins the current state-of-the-art jlTT, |2|. 

The Viola and Jones approach uses a cascade of increas- 
ingly complex classifiers, each of which aims to achieve 
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the best possible classification accuracy while achieving an 
extremely low false negative rate. These classifiers can be 
seen as forming the nodes of a degenerate binary tree (see 
Fig. [TJ whereby a negative result from any single such node 
classifier terminates the interrogation of the current patch. 
Viola and Jones use AdaBoost to train each node classifier in 
order to achieve the best possible classification accuracy. A 
low false negative rate is achieved by subsequently adjust- 
ing the decision threshold until the desired false negative 
rate is achieved. This process cannot be guaranteed to 
produce the best classification accuracy for a given false 
negative rate. 

Under the assumption that each node of the cascade 
classifier makes independent classification errors, the de- 
tection rate and false positive rate of the entire cascade 
are: = Y[ t —i d t and Ff p — Yl t=1 ft, respectively, where 
dt represents the detection rate of classifier t, f t the corre- 
sponding false positive rate and N the number of nodes. 
As pointed out in JTJ, (6), these two equations suggest a 
node learning objective: Each node should have an extremely 
high detection rate dt (e.g., 99.7%) and a moderate false 
positive rate ft (e.g., 50%). With the above values of d t and 
ft, and a cascade of N — 20 nodes, then Fa r ~ 94% and 
Ffp s» 10~ 6 , which is a typical design goal. 

One drawback of the standard AdaBoost approach to 
boosting is that it does not take advantage of the cascade 
classifier's special structure. AdaBoost only minimizes the 
overall classification error and does not minimize the num- 
ber of false negatives. In this sense, the features selected are 
not optimal for the purpose of rejecting as many negative 
examples as possible. Viola and Jones proposed a solution 
to this problem in AsymBoost |7| (and its variants |8|, |9|, 
1 14 1, [15 1) by modifying the exponential loss function so 
as to more greatly penalize false negatives. AsymBoost 
achieves better detection rates than AdaBoost, but still 
addresses the node learning goal indirectly, and cannot be 
guaranteed to achieve the optimal solution. 

Wu et ai. explicitly studied the node learning goal and 
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proposed to use linear asymmetric classifier (LAC) and 
Fisher linear discriminant analysis (LDA) to adjust the 
weights on a set of features selected by AdaBoost or 
AsymBoost jl], j2). Their experiments indicated that with 
this post-processing technique the node learning objective 
can be better met, which is translated into improved de- 
tection rates. In Viola and Jones' framework, boosting is 
used to select features and at the same time to train a 
strong classifier. Wu et al.'s work separates these two tasks: 
AdaBoost or AsymBoost is used to select features; and as 
a second step, LAC or LDA is used to construct a strong 
classifier by adjusting the weights of the selected features. 
The node learning objective is only considered at the second 
step. At the first step — feature selection — the node learning 
objective is not explicitly considered at all. We conjecture 
that further improvement may be gained if the node learning 
objective is explicitly taken into account at both steps. We thus 
propose new boosting algorithms to implement this idea 
and verify this conjecture. A preliminary version of this 
work was published in Shen et al. p6) . 
Our major contributions are as follows. 

1) Starting from the theory of minimax probability ma- 
chines (MPMs), we derive a simplified version of 
the biased minimax probability machine, which has 
the same formulation as the linear asymmetric classi- 
fier of jl). We thus show the underlying connection 
between MPM and LAC. Importantly, this new in- 
terpretation weakens some of the restrictions on the 
acceptable input data distribution imposed by LAC. 

2) We develop new boosting-like algorithms by directly 
minimizing the objective function of the linear asym- 
metric classifier, which results in an algorithm that we 
label LACBoost. We also propose FisherBoost on the 
basis of Fished LDA rather than LAC. Both methods 
may be used to identify the feature set that optimally 
achieves the node learning goal when training a 
cascade classifier. To our knowledge, this is the first 
attempt to design such a feature selection method. 

3) LACBoost and FisherBoost share similarities with 
LPBoost fl7| in the sense that both use col- 
umn generation — a technique originally proposed for 
large-scale linear programming (LP). Typically, the 
Lagrange dual problem is solved at each iteration 
in column generation. We instead solve the primal 
quadratic programming (QP) problem, which has a 
special structure and entropic gradient (EG) can be 
used to solve the problem very efficiently. Compared 
with general interior-point based QP solvers, EG is 
much faster. 

4) We apply LACBoost and FisherBoost to object de- 
tection and better performances are observed over 
the state-of-the-art methods (l), |2)> (T8). The results 
confirm our conjecture and show the effectiveness of 
LACBoost and FisherBoost. These methods can be im- 
mediately applied to other asymmetric classification 
problems. 

Moreover, we analyze the condition that makes the va- 
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Fig. 1: Cascade classifiers. The first one is the standard cascade of Viola 
and Jones |6 |. The second one is the multi-exit cascade proposed in ]9|. 
Only those classified as true detection by all nodes will be true targets. 

lidity of LAC, and show that the multi-exit cascade might 
be more suitable for applying LAC learning of |1|, |2| (and 
our LACBoost) rather than Viola-Jones standard cascade. 

As observed in Wu et al. El, in many cases, LDA even 
performs better than LAC. In our experiments, we have 
also observed similar phenomena. Paisitkriangkrai et al. 
|TT) empirically showed that LDA's criterion can be used 
to achieve better detection results. An explanation of why 
LDA works so well for object detection is missing in the 
literature. Here we demonstrate that in the context of object 
detection, LDA can be seen as a regularized version of LAC 
in approximation. 

The proposed LACBoost/FisherBoost algorithm differs 
from traditional boosting algorithms in that it does not 
minimize a loss function. This opens new possibilities for 
designing new boosting algorithms for special purposes. 
We have also extended column generation for optimizing 
nonlinear optimization problems. 

1.1 Related Work 

The three components making up the Viola and Jones' 
detection approach are: 

1) The cascade classifier, which efficiently filters out 
negative patches in early nodes while maintaining a 
very high detection rate; 

2) AdaBoost that selects informative features and at the 
same time trains a strong classifier; 

3) The use of integral images, which makes the compu- 
tation of Haar features extremely fast. 

This approach has received significant subsequent atten- 
tion. A number of alternative cascades have been devel- 
oped including the soft cascade |19 |, the dynamic cascade 
1 20 1, and the multi-exit cascade |9|. In this work we have 
adopted the multi-exit cascade that aims to improve clas- 
sification performance by using the results of all of the 
weak classifiers applied to a patch so far in reaching a 
decision at each node of the tree (see Fig. [lj. Thus the n- 
th node classifier uses the results of the weak classifiers 
associated with node n, but also those associated with the 
previous n — 1 node classifiers in the cascade. We show 
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below that LAC post-processing can enhance the multi-exit 
cascade, and that the multi-exit cascade more accurately 
fulfills the LAC requirement that the margin be drawn from 
a Gaussian distribution. 

There have also been a number of improvements sug- 
gested to the Viola and Jones approach to the learning al- 
gorithm for constructing a classifier. Wu et al, for example, 
use fast forward feature selection to accelerate the training 
procedure pi) . Wu et al. Q also showed that LAC may 
be used to deliver better classification performance. Pham 
and Cham recently proposed online asymmetric boosting 
that considerably reduces the training time required |8|. By 
exploiting the feature statistics, Pham and Cham have also 
designed a fast method to train weak classifiers \22\. Li et 
al. proposed FloatBoost, which discards redundant weak 
classifiers during AdaBoost's greedy selection procedure 
1 10 1 . Liu and Shum also proposed KLBoost, aiming to 
select features that maximize the projected Kullback-Leibler 
divergence and select feature weights by minimizing the 
classification error (23) . Promising results have also been 
reported by LogitBoost (24) that employs the logistic regres- 
sion loss, and GentleBoost |25| that uses adaptive Newton 
steps to fit the additive model. Multi-instance boosting has 
been introduced to object detection |26], (27) , (28) , which 
does not require exactly labeled locations of the targets in 
training data. 

New features have also been designed for improving the 
detection performance. Viola and Jones' Haar features are 
not sufficiently discriminative for detecting more complex 
objects like pedestrians, or multi-view faces. Covariance 
features [24] and histogram of oriented gradients (HOG) 
[29 ] have been proposed in this context, and efficient imple- 
mentation approaches (along the lines of integral images) 
are developed for each. Shape context, which can also ex- 
ploit integral images |30|, was applied to human detection 
in thermal images 1311. The local binary pattern (LBP) 
descriptor and its variants have been shown promising 
performance on human detection [32], |33|. Recently, effort 
has been spent on combining complementary features, 
including: simple concatenation of HOG and LBP |34|, 
combination of heterogeneous local features in a boosted 
cascade classifier |35|, and Bayesian integration of intensity, 
depth and motion features in a mixture-of-experts model 
(36). 



The paper is organized as follows. We briefly review 
the concept of minimax probability machine and derive 
the new simplified version of biased minimax probability 
machine in Section [2] Linear asymmetric classification and 
its connection to the minimax probability machine is dis- 
cussed in Section [3] In Section [3] we show how to design 
new boosting algorithms (LACBoost and FisherBoost) by 
rewriting the optimization formulations of LAC and Fisher 
LDA. The new boosting algorithms are applied to object 
detection in Section [5] and we conclude the paper in Section 

m 



1.2 Notation 

The following notation is used. A matrix is denoted by a 
bold upper-case letter (X); a column vector is denoted by a 
bold lower-case letter (x). The ith row of X is denoted by 
Xi : and the ith column X : ^. The identity matrix is I and its 
size should be clear from the context. 1 and are column 
vectors of l's and 0's, respectively. We use I>=, =<! to denote 
component-wise inequalities. 

Let T = {(xi,yi)}i=i,— ,m be the set of training data, 
where 6 X and i/i 6 {— 1, +1}, Vi. The training set 
consists of mi positive training points and m 2 negative 
ones; m,\ + m 2 = m. Let h(-) eMbea weak classifier that 
projects an input vector x into {— 1, +1}. Note that here 
we consider only classifiers with discrete outputs although 
the developed methods can be applied to real- valued weak 
classifiers too. We assume that H, the set from which h(-) 
is selected, is finite and has n elements. 

Define the matrix H z 6 R mx ™ such that the («, j) entry 
Hfj = hj(xi) is the label predicted by weak classifier hj(-) 
for the datum x^, where x^ the ith element of the set Z. In 
order to simplify the notation we eliminate the superscript 
when Z is the training set, so H z = H. Therefore, each 
column H : j of the matrix H consists of the output of weak 
classifier hj(-) on all the training data; while each row H; : 
contains the outputs of all weak classifiers on the training 
datum Xi. Define similarly the matrix A € R mx ™ such 
that Ajj = yihj(xi). Note that boosting algorithms entirely 
depends on the matrix A and do not directly interact with 
the training examples. Our following discussion will thus 
largely focus on the matrix A. We write the vector obtained 
by multiplying a matrix A with a vector w as Aw and its 
ith entry as (Aw),;. If we let w represent the coefficients of 
a selected weak classifier then the margin of the training 
datum Xi is pi = A^ : w = (Aw)i and the vector of such 
margins for all of the training data is p = Aw. 

2 Minimax Probability Machines 

Before we introduce our boosting algorithm, let us briefly 
review the concept of minimax probability machines 
(MPM) (37) first. 

2.1 Minimax Probability Classifiers 

Let Xi g R n and X2 6 R™ denote two random vectors 
drawn from two distributions with means and covariances 
(/Lt^Ei) and (/x 2 ,£ 2 ), respectively. Here H lf H 2 £ R" 
and Ei,E 2 £ R nx ". We define the class labels of xi 
and x 2 as +1 and —1, w.l.o.g. The minimax probability 
machine (MPM) seeks a robust separation hyperplane that 
can separate the two classes of data with the maximal 
probability. The hyperplane can be expressed as w T x = b 
with w e R n \{0} and b e R. The problem of identifying 
the optimal hyperplane may then be formulated as 



max 7 s.t. 

w,b,7 



inf Pr{w T xi > b} 

xi~(/i 1 ,Si) 

inf Pr{w T x 2 < b} 

X 2 ~0 2 > S 2) 



> 7, (1) 
>7- 
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Here 7 is the lower bound of the classification accuracy 
(or the worst-case accuracy) on test data. This problem can 
be transformed into a convex problem, more specifically a 
second-order cone program (SOCP) |38 | and thus can be 
solved efficiently |37). 

Before we present our results, we introduce an important 
proposition from |39|. Note that we have used different 
notation. 

Proposition 2.1. For a few different distribution families, the 
ivorst-case constraint 



inf Pr{w T x < b} 



>7, 



(2) 



can be written as: 



1) if x ~ S), i.e., x follows an arbitrary distribution 
with mean /j, and covariance S, then 



b > w T /i - 
2) if x ~ (fi, S)g Q then we have 



^- ■ \/w T Sw; 

1 — 7 ' 



(3) 



6 > w T /x - 

W T /X, 



6 > - T 



^VwTEw, if 7 e (0.5,1); 

if 7 € (0,0.5]; 

(4) 



3) if x- Qu,£)su, f^ew 



6 > w T /x 
6 > w T /i,, 



1^2(1^) -Vw^Sw, if 7 G (0.5,1); 



if 7 G (0,0.5] 



(5) 

4) if x follows a Gaussian distribution with mean fi and 
covariance X, i.e., x ~ G{n, S), f/ien 



o>w T /x + <f> 1 (7) • Vw T Sw, 



(6) 



where <&(•) is f/ze cumulative distribution function (c.d.f.) 
of the standard normal distribution 5(0,1), and 
is the inverse function of $(•). 

Two useful observations about < £ > ~ 1 (-) are: 1 (0.5) = 0; 
and ( E > ~ 1 (-) is a monotonically increasing function in its 
domain. 

We omit the proof here and refer the reader to (39) for 
details. 



2.2 Biased Minimax Probability Machines 

The formulation (l) assumes that the classification problem 
is balanced. It attempts to achieve a high recognition accu- 
racy, which assumes that the losses associated with all mis- 
classifications are identical. However, in many applications 
this is not the case. 



1. Here (/x, £)g denotes the family of distributions in (p, S) that 
are also symmetric about the mean \x. (/tt, £)su denotes the family 
of distributions in (/u, S) that are additionally symmetric and linear 
unimodal about fi. 



Huang et al. (40) proposed a biased version of MPM 
through a slight modification of (l), which may be formu- 
lated as 



max 7 s.t. 

w,6,7 



inf Pr{w T xi > 6} 
inf Pr{w T X2 < 6} 

"(M2,E 2 ) 



>7, 



>7c 



(7) 



Here 70 G (0, 1) is a prescribed constant, which is the 
acceptable classification accuracy for the less important 
class. The resulting decision hyperplane prioritizes the 
classification of the important class xi over that of the 
less important class x 2 . Biased MPM is thus expected to 
perform better in biased classification applications. 

Huang et al. showed that ^ can be iteratively solved 
via solving a sequence of SOCPs using the fractional pro- 
gramming (FP) technique. Clearly it is significantly more 
computationally demanding to solve ^ than ([TJ. 

In this paper we are interested in the special case of 
7o = 0.5 due to its important application in cascade object 
detection (l], (6). In the following discussion, for simplicity, 
we only consider j Q = 0.5 although some algorithms 
developed may also apply to 70 < 0.5. 

Next we show how to re-formulate |7|l into a simpler 
quadratic program (QP) based on the recent theoretical 
results in [39 1. 

2.3 Simplified Biased Minimax Probability Machines 

Equation <[3j represents the most general of the four cases 
presented in equations ^ through (6|, and is used in MPM 
| [37) and the biased MPM |40| because it does not impose 
constraints upon the distributions of xi and x 2 . On the 
other hand, one may take advantage of prior knowledge 
whenever available. For example, it is shown in Q] that 
in face detection, the weak classifier outputs can be well 
approximated by Gaussian distributions. Equation |3]l does 
not utilize any this type of a priori information, and hence, 
for many problems, |3) is too conservative. 

Let us consider the special case of 7 = 0.5. It is easy to see 
that the worst-case constraint |2j becomes a simple linear 
constraint for symmetric, symmetric unimodal, as well as 
Gaussian distributions. As pointed in ( 39) , such a result 
is the immediate consequence of symmetry because the 
worst-case distributions are forced to put probability mass 
arbitrarily far away on both sides of the mean. In such a 
case any information about the covariance is neglected. 

We now apply this result into biased MPM as represented 
by |7). Our main result is the following theorem. 

Theorem 2.1. With 70 = 0.5, the biased minimax problem (|7) 
can be formulated as an unconstrained problem ( [16) under the 
assumption that x 2 follows a symmetric distribution. The worst- 
case classification accuracy for the first class, 7*, is obtained by 
solving 



V(7*) 



-b* + a*Vi 



(8) 



\/ w* T £iW* 

where ip(-) is defined in (TT) ; {w*,6*} is the optimal solution 
of (15) and (16). 
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Proof: The second constraint of (7) is simply 
b > w T /x 2 . 



(9) 



The first constraint of (7) can be handled by writing 
w T xi > b as — w T xi < —b and applying the results in 
Proposition 2.1 It can be written as 



— b + w T fi 1 > ^(7) yw^EiW, 



(10) 



with 



7 
1-7 



2(1-7) 
1 



2(1-7) 



if X! 
if X! 
if X! 

if xi 



(Mi,Si)s, 

(^1,Si)sUi 



(11) 



Let us assume that Si is strictly positive definite (if it 
is only positive semidefinite, we can always add a small 
regularization to its diagonal components). From ( 10 1 we 
have 



-b + w T Hi 



\/w T Eiw 
So the optimization problem becomes 

.t. {9} and (12}. 



max 7, s. 



(12) 



(13) 



The maximum value of 7 (which we label 7*) is achieved 
when | [T2"} is strictly an equality. To illustrate this point, let 
us assume that the maximum is achieved when 



\J w T £iw 



Then a new solution can be obtained by increasing 7* with 
a positive value such that (12) becomes an equality. Notice 
that the constraint (9) will not be affected, and the new 
solution will be better than the previous one. Hence, at the 
optimum, (8) must be fulfilled. 

Because ^(7) is monotonically increasing for all the four 
cases in its domain (0, 1) (see Fig. |5J, maximizing 7 is 
equivalent to maximizing ^(7) and this results in 



max 

w,6 



W T /Xl 



\/w T £iW 



s.t. b > w T /^t 2 . 



(14) 



As in 1 37 1, |40|, we also have a scale ambiguity: if (w*, b*) 
is a solution, (tw* ,tb*) with t > is also a solution. 

An important observation is that the problem (14} must 
attain the optimum at 



b = w T /x 2 . 



(15) 



Otherwise if b > w T /x 2 , the optimal value of (Ti) must be 
smaller. So we can rewrite (14) as an unconstrained problem 



max 



w T (/Xi - M2) 
1/ w T SiW 



(16) 



We have thus shown that, if x x is distributed according 
to a symmetric, symmetric unimodal, or Gaussian distribu- 
tion, the resulting optimization problem is identical. This is 



3- 





/ 1 1 
/ / » 1 

/ ' 1 ' 
/'it 


/ 

/ 

- / 
f 
1 
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^su(7) 

---^5(7) 
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Fig. 2: The function in {11) . The four curves correspond to the four 
cases. They are all monotonically increasing in (0, 1). 



not surprising considering the latter two cases are merely 
special cases of the symmetric distribution family. 

At optimality the inequality (12} becomes an equation, 
and hence 7* can be obtained as in (8). For ease of exposi- 
tion, let us denote the fours cases in the right side of (IT} as 
^gnri(-)/ ^s(')> fsu(-), and ipg(-). For 7 e [0.5, 1), as shown 
in Fig. [2] we have tp gn r\(j) > fs(l) > ^su(7) > fsil)- 
Therefore, when solving (8) for 7*, we have 7* nrl < 7g < 

7su < lg- ^ at i s t0 sa Y> one can § et better accuracy 
when additional information about the data distribution is 
available, although the actual optimization problem to be 
solved is identical. □ 

We have derived the biased MPM algorithm from a 
different perspective. We reveal that only the assumption 
of symmetric distributions is needed to arrive at a simple 
unconstrained formulation. Compared the approach in |40|, 
we have used more information to simply the optimization 
problem. More importantly, as well will show in the next 
section, this unconstrained formulation enables us to design 
a new boosting algorithm. 

There is a close connection between our algorithm and 
the linear asymmetric classifier (LAC) in jlj. The resulting 
problem (l6) is exactly the same as LAC in (TJ. Removing 
the inequality in this constraint leads to a problem solvable 
by eigen-decomposition. We have thus shown that the 
results of Wu et al. may be generalized from the Gaussian 
distributions assumed in |1| to symmetric distributions. 

It is straightforward to kernelize the linear classifier that 
we have discussed, following the work of (37), Eo). Here 
we are more interested, however, in designing a boosting 
algorithm that takes the biased learning goal into consid- 
eration when selecting features. 

3 Linear Asymmetric Classification 

We have shown that starting from the biased minimax prob- 
ability machine, we are able to obtain the same optimiza- 
tion formulation as shown in (lj, while much weakening 
the underlying assumption (symmetric distributions versus 
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Gaussian distributions). Before we propose our LACBoost 
and FisherBoost, however, we provide a brief overview of 
LAC. 

Wu et al. |2j proposed linear asymmetric classification 
(LAC) as a post-processing step for training nodes in the 
cascade framework. In |2|, it is stated that LAC is guaran- 
teed to reach an optimal solution under the assumption 
of Gaussian data distributions. We now know that this 
Gaussianality condition may be relaxed. 

Suppose that we have a linear classifier /(x) = 
sign(w T x — b). We seek a {w, b} pair with a very high 
accuracy on the positive data Xi and a moderate accuracy 
on the negative X2 . This can be expressed as the following 
problem: 



max 

w^O.b xi 
S.t. 



Pr {w T xi > b}, 

Pr {w T x, < b\ = A. 



(17) 



In (T), A is set to 0.5 and it is assumed that for any w, 
w T X! is Gaussian and w T x 2 is symmetric, fl7| can be 
approximated by (16) . Again, these assumptions may be 
relaxed as we have shown in the last section. ( fl6) is similar 
to LDA's optimization problem 

w T (Mi 



max 



M 2 ) 



w#o v / w T (Si + £ 2 )w ' 



(18) 



( 16 1 can be solved by eigen-decomposition and a close- 



formed solution can be derived: 



b* 



w*V 2 . 



(19) 



On the other hand, each node in cascaded boosting classi- 
fiers has the following form: 



/(x) = sign(w T H(x) - b), 



(20) 



We override the symbol H(x) here, which denotes the out- 
put vector of all weak classifiers over the datum x. We can 
cast each node as a linear classifier over the feature space 
constructed by the binary outputs of all weak classifiers. 
For each node in cascade classifier, we wish to maximize 
the detection rate while maintaining the false positive rate 
at a moderate level (for example, around 50.0%). That 
is to say, the problem fl6) represents the node learning 
goal. Boosting algorithms such as AdaBoost can be used as 
feature selection methods, and LAC used to learn a linear 
classifier over those binary features chosen by boosting. 
The advantage of this approach is that LAC considers the 
asymmetric node learning explicitly. 

However, there is a precondition on the validity of LAC 
that for any w, w T Xi is a Gaussian and w T x 2 is symmetric. 
In the case of boosting classifiers, w T X! and w T x 2 can be 
expressed as the margin of positive data and negative data, 
respectively. Empirically Wu et al. |2| verified that w T x 
is approximately Gaussian for a cascade face detector. We 
discuss this issue in more detail in Section |U Shen and Li 
|4l) theoretically proved that under the assumption that 
weak classifiers are independent, the margin of AdaBoost 
follows the Gaussian distribution, as long as the number of 



weak classifiers is sufficiently large. In Section [5] we verify 
this theoretical result by performing the normality test on 
nodes with different number of weak classifiers. 



4 Constructing Boosting Algorithms 
from LDA and LAC 

In kernel methods, the original data are nonlinearly 
mapped to a feature space by a mapping function $(•). 
The function need not be known, however, as rather than 
being applied to the data directly, it acts instead through the 
inner product $(x i ) T $(x J ). In boosting [42], however, the 
mapping function can be seen as being explicitly known, 
as $(x) : x i — ^ [/ii(x), . . . , /i n (x)]. Let us consider the 
Fisher LDA case first because the solution to LDA will 
generalize to LAC straightforwardly, by looking at the 
similarity between (16} and fl8) . 

Fisher LDA maximizes the between-class variance and 
minimizes the within-class variance. In the binary-class 



case, the more general formulation in ( 18 can be expressed 
as 

(Mi - M 2 ) 2 w T C b w 



0"! + cr 2 



(21) 



where Cf, and C w are the between-class and within-class 
scatter matrices; and /i. 2 are the projected centers of 
the two classes. The above problem can be equivalently 
reformulated as 



min w T C w mv — 9{p 1 — fi 2 ) 



(22) 



for some certain constant 8 and under the assumption that 
A*i M2 ^ Oj^Now in the feature space, our data are $(xi), 
i = 1 . . . m. Define the vectors e,ei,e 2 6 R m such that 
e = ei + e 2 , the i-th entry of e\ is 1/mi if yi = +1 and 
otherwise, and the i-th entry of e 2 is l/m 2 if = — 1 and 
otherwise. We then see that 



1 = ± w t J2 $( Xl ) = J- £ A l: 
mi mi 



Vi = l 



— V (Aw), = e[Aw, 
mi 



(23) 



and 



V-2 



— w T V $(x ; ) = — V H i: w = -el, Aw, 



Vi=-1 



3/i=-l 



(24) 



For ease of exposition we order the training data according 
to their labels so the vector e e R m : 



e= [1/mi,-- - ,l/m 2 ,---] T , 



(25) 



and the first mi components of p correspond to the positive 
training data and the remaining ones correspond to the m 2 
negative data. We now see that — fi 2 = e T p, C u , = 

2. In our face detection experiment, we found that this assumption 
could always be satisfied. 
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mi/m • Si + m^/m- E 2 with S12 the covariance matrices. 
Noting that 



w T Si 2W 



1 

TOl,2(mi >2 - 1) 



Yl [Pi-Pkf, 



i>k,y i= y k =±l 



we can easily rewrite the original problem l |21) (and p2} ) 
into: 

min ^p T Qp — 8e r p, 



s.t. w > 0, 1 w = 1, 



Here Q = 



Qi 



Qi 
Q 2 



pi = (Aw)„i = 1, • • • ,m. 
is a block matrix with 



(26) 



m(mi 



m(mi — 1) 



m(mi 
m(mi 



m(mi — 1) m(mi-l) " m 

and Q2 is similarly defined by replacing mi with in 
Qi: 



Q 2 



i_ 

i(rri2- 



m(rri2 — 1) 



m(m2 



m(ma-l) 



m(m,2 — 1) 
m(m 2 — 1) 



Q 



Also note that we have introduced a constant | before 
the quadratic term for convenience. The normalization 
constraint l T w = 1 removes the scale ambiguity of w. 
Without it the problem is ill-posed. 

We see from the form of ( (16} that the covariance of the 
negative data is not involved in LAC and thus that if we set 

^} Q then p6| becomes the optimization problem 

of LAC. 

There may be extremely (or even infinitely) many weak 
classifiers in T~L, the set from which h(-) is selected, meaning 
that the dimension of the optimization variable w may 
also be extremely large. So ( |26) is a semi-infinite quadratic 
program (SIQP). We show how column generation can be 
used to solve this problem. To make column generation 
applicable, we need to derive a specific Lagrange dual of 
the primal problem. 



4.1 The Lagrange Dual Problem 

We now derive the Lagrange dual of the quadratic problem 
(26 . Although we are only interested in the variable w, we 
need to keep the auxiliary variable p in order to obtain a 
meaningful dual problem. The Lagrangian of ([26} is 



L(w, p , u, r) = ^p Qp — 9e p + u (p — Aw) q w 

primal dual 

+ r(l T w- 1), 



with q )?= 0. sup u r 
Lagrange dual: 



inf WjP L(w, p, u, r) gives the following 



rcgularization 



max — r 

u,r 



§(u-0e) T Q _1 (u-0e), s.t. ^U;A i: =^rl T . 



(27) 

In our case, Q is rank-deficient and its inverse does not 
exist (for both LDA and LAC). We can simply regularize 
Q with Q + <5I with 8 a small positive constant. Actually, 
Q is a diagonally dominant matrix but not strict diagonal 
dominance. So Q + 51 with any S > is strict diagonal 
dominance and by the Gershgorin circle theorem, a strictly 
diagonally dominant matrix must be invertible. 

One of the KKT optimality conditions between the dual 
and primal 

p* = -Q- 1 (u*-fle), (28) 

which can be used to establish the connection between the 
dual optimum and the primal optimum. This is obtained 
by the fact that the gradient of L w.r.t. p must vanish at 
the optimum, dL/dpi = 0, Mi = 1 • • • n. 

Problem ( |27) can be viewed as a regularized LPBoost 
problem. Compared with the hard-margin LPBoost |17|, 
the only difference is the regularization term in the cost 
function. The duality gap between the primal p6} and the 
dual l |2"7} is zero. In other words, the solutions of ( |26) and 
( |27) coincide. Instead of solving (26} directly, one calculates 
the most violated constraint in ( |27| iteratively for the cur- 
rent solution and adds this constraint to the optimization 
problem. In theory, any column that violates dual feasibility 
can be added. To speed up the convergence, we add the 
most violated constraint by solving the following problem: 



h'(-) = argmax^.) ^^^(x. 



(29) 



This is exactly the same as the one that standard AdaBoost 
and LPBoost use for producing the best weak classifier. 
That is to say, to find the weak classifier that has min- 
imum weighted training error. We summarize the LAC- 
Boost/FisherBoost algorithm in Algorithm [l] By simply 
changing Q2, Algorithm [T] can be used to train either 
LACBoost or FisherBoost. Note that to obtain an actual 
strong classifier, one may need to include an offset 6, i.e. 
the final classifier is 2jj=i ^j( x ) ~ b because from the cost 
function of our algorithm ( [22} , we can see that the cost 
function itself does not minimize any classification error. 
It only finds a projection direction in which the data can 
be maximally separated. A simple line search can find an 
optimal b. Moreover, when training a cascade, we need to 
tune this offset anyway as shown in {20} . 

The convergence of Algorithm[l]is guaranteed by general 
column generation or cutting-plane algorithms, which is 
easy to establish. When a new h'(-) that violates dual 
feasibility is added, the new optimal value of the dual 
problem (maximization) would decrease. Accordingly, the 
optimal value of its primal problem decreases too because 
they have the same optimal value due to zero duality gap. 
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Algorithm 1 Column generation for SIQP. 

Input: Labeled training data (x;, yi),i = 1 • • • m; termina- 
tion threshold e > 0; regularization parameter 9; 
maximum number of iterations n max . 

1 Initialization: m — 0; w = 0; and u; — — , i — 1- • ■ m. 

2 for iteration = 1 : n max do 

3 — Check for the optimality: 
if iteration > 1 and 2~}j=i utyih' (xj) < r + e, 
then 

break; and the problem is solved; 

— Add h'(-) to the restricted master problem, which 
corresponds to a new constraint in the dual; 

— Solve the dual problem (27) (or the primal problem 
{26}) and update r and m (i = 1 • • • m). 

— Increment the number of weak classifiers n = n+l. 

Output: The selected features are hi,h2,...,h n - The final 
strong classifier is: F(x) = S?=i Wjhj{x)—b. Here 
the offset 6 can be learned by a simple search. 



Moreover the primal cost function is convex, therefore in 
the end it converges to the global minimum. 

At each iteration of column generation, in theory, we 
can solve either the dual {27} or the primal problem {26} . 
However, in practice, it could be much faster to solve the 
primal problem because 

1) Generally, the primal problem has a smaller size, 
hence faster to solve. The number of variables of 
{27} is m at each iteration, while the number of 
variables is the number of iterations for the primal 
problem. For example, in Viola-Jones' face detection 
framework, the number of training data m — 10, 000 
and n max = 200. In other words, the primal problem 
has at most 200 variables in this case; 

2) The dual problem is a standard QP problem. It has 
no special structure to exploit. As we will show, 
the primal problem belongs to a special class of 
problems and can be efficiently solved using en- 
tropic /exponentiated gradient descent (EG) |43|, |44|. 
A fast QP solver is extremely important for training 
a object detector because we need to the solve a few 
thousand QP problems. 

We can recover both of the dual variables u* , r* easily 
from the primal variable w*: 



u 

r* 



max {TT=i u i A ij}- 



(30) 
(31) 



The second equation is obtained by the fact that in the dual 
problem's constraints, at optimum, there must exist at least 
one u* such that the equality holds. That is to say, r* is the 
largest edge over all weak classifiers. 

We give a brief introduction to the EG algorithm before 
we proceed. Let us first define the unit simplex A n = {w £ 
R™ : l T w = l,w !== 0}. EG efficiently solves the convex 
optimization problem 



/(w), s.t. w e A n , 



(32) 



under the assumption that the objective function /(■) is a 
convex Lipschitz continuous function with Lipschitz con- 



stant Lf w.r.t. a fixed given norm The mathematical 
definition of Lf is that |/(w) — /(z)| < L/||x — z|| holds 
for any x, z in the domain of /(•). The EG algorithm is very 
simple: 

1) Initialize with w° 6 the interior of A„; 

2) Generate the sequence {w fc }, k = 1, 2, • • • with: 



r fc-i-\ 



k _ w* 1 exp[-r k f j (w k n 



(33) 



/4(w) 



Here r k is the step-size, /'(w) = [/((w), 
is the gradient of /(•); 
3) Stop if some stopping criteria are met. 

The learning step-size can be determined by r k = 
V2io g n 1 followin „ In ^ au thors have used 

a simpler strategy to set the learning rate. 

EG is a very useful tool for solving large-scale convex 
minimization problems over the unit simplex. Compared 
with standard QP solvers like Mosek [45], EG is much 
faster. EG makes it possible to train a detector using almost 
the same amount of time as using standard AdaBoost as the 
majority of time is spent on weak classifier training and 
bootstrapping. 

In the case that mi 3> 1, 



Q 



1 — 



in 



1 



1 

mi — 1 



mi - 
1 



mi — 1 



m 1^— 1 
mi— 1 



Similarly, for LDA, Q2 ~ ^1 when ^> 1. Hence, 



Q 



for Fisher LDA, 
for LAC. 



(34) 



Therefore, the problems involved can be simplified when 
m\ 1 and m 2 ^S> 1 hold. The primal problem {26} equals 



(35) 



min k T (A T QA)w-(fle T A)w, s.t. w G A r 



We can efficiently solve (35 using the EG method. In 
EG there is an important parameter Lf, which is used to 
determine the step-size. Lf can be determined by the ioo- 
norm of |/'(w)|. In our case /'(w) is a linear function, 
which is trivial to compute. The convergence of EG is 
guaranteed; see (43) for details. 

In summary, when using EG to solve the primal problem, 
Line 5 of Algorithm [l] is: 



Solve the primal problem (35 1 using EG, and update the 
dual variables u with {30}, and r with {3T). 



5 Experiments 

In this section, we first show an experiment on toy data 
and then apply the proposed methods to face detection and 
pedestrian detection. 
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Algorithm 2 The procedure for training a multi-exit cascade 

with LACBoost or FisherBoost. 

Input: 

— A training set with m examples, which are ordered by 
their labels (mi positive examples followed by 7712 negative 
examples); 

— d m i n : minimum acceptable detection rate per node; 

— /ma X : maximum acceptable false positive rate per node; 

— Ff p : target overall false positive rate. 

1 Initialize: 

t = 0; (node index) 

n — 0; (total selected weak classifiers up to the current node) 
D t — I; F t — 1. (overall detection rate and false positive rate 
up to the current node) 

2 while Ffp < F t do 

3 t = t + 1; (increment node index) 
while d t < dmin do 

(current detection rate d t is not acceptable yet) 

— n — n + 1, and generate a weak classifier and 
update all the weak classifiers' linear coefficient 
using LACBoost or FisherBoost. 

— Adjust threshold b of the current boosted strong 
classifier 



Fig. 3: Decision boundaries of AdaBoost (top) and FisherBoost (bottom) 
on 2D artificial data (positive data represented by L7s and negative 
data by x's). Weak classifiers are decision stumps. In this case, Fisher- 
Boost intends to correctly classify more positive data in this case. 



5.1 Synthetic Testing 

First, let us show a simple example on a synthetic dataset 
(more negative data than positive data) to illustrate the dif- 
ference between FisherBoost and AdaBoost. Fig. [3] demon- 
strates the subtle difference of the classification boundaries 
obtained by AdaBoost and FisherBoost when applied to 
these data. We can see that FisherBoost places more empha- 
sis on correctly classifying positive data points than does 
AdaBoost. This might be due to the fact that AdaBoost only 
optimizes the overall classification accuracy. This finding is 
consistent with the result in |11|. 

5.2 Face Detection Using a Cascade Classifier 

In this section, we compare FisherBoost and LACBoost with 
the state-of-the-art in face detection. 

We first show some results about the validity of LAC 
(and Fisher LDA) post-processing for improving node 
learning in object detection. 

The algorithm for training a multi-exit cascade is shown 
in Algorithm [2] 

As is described above, LAC and LDA assume that the 
margins of training data associated with the node classifiers 
in such a cascade exhibit a Gaussian distribution. In order 
to evaluate the degree to which this is true for the face 
detection task we show in Fig. [5] normal probability plots 
of the margins of the positive training data for each of the 
first three node classifiers in a multi-exit LAC cascade. The 
figure shows that the larger the number of weak classifiers 



F<(x) = ]>><Mx)-6 



SUCh that ft ~ /max- 

— Update the detection rate of the current node d t 
with the learned boosted classifier. 
Update D t +i = A X d t ; F t+1 = Ft x f 
Remove correctly classified negative samples from neg- 
ative training set. 
if Ftp < F t then 

Evaluate the current cascaded classifier on the neg- 
ative images and add misclassified samples into the 
negative training set; (bootstrap) 

Output: A multi-exit cascade classifier with n weak clas- 
sifiers and t nodes. 



used the more closely the margins follow a Gaussian dis- 
tribution. From this we infer that LAC, Fisher LDA post- 
processing, (and thus LACBoost and FisherBoost) can be 
expected to achieve a better performance when a larger 
number of weak classifiers are used. We therefore apply 
LAC /LDA only within the later nodes (for example, 9 
onwards) of a multi-exit cascade as these nodes contain 
more weak classifiers. Because the late nodes of a multi- 
exit cascade contain more weak classifiers than the stan- 
dard Viola-Jones' cascade, we conjecture that the multi-exit 
cascade might meet the Gaussianity requirement better. We 
have compared multi-exit cascades with LDA /LAC post- 
processing against standard cascades with LDA/LAC post- 
processing in [2| and slightly improved performances were 
obtained. 

Six methods are evaluated with the multi-exit cas- 
cade framework |9 |, which are AdaBoost with LAC post- 
processing, or LDA post-processing, AsymBoost with LAC 
or LDA post-processing |2|, and our FisherBoost, LAC- 
Boost. We have also implemented Viola-Jones' face detector 
as the baseline |6). As in (6), five basic types of Haar- 
like features are calculated, resulting in a 162, 336 dimen- 
sional over-complete feature set on an image of 24 x 24 
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pixels. To speed up the weak classifier training, as in (2), 
we uniformly sample 10% of features for training weak 
classifiers (decision stumps). The training data are 9, 832 
mirrored 24 x 24 face images (5, 000 for training and 4, 832 
for validation) and 7, 323 large background images, which 
are the same as in |2). The face images for training are 
provided by Viola and Jones' work — the same as the face 
training data used in (6). 

Multi-exit cascades with 22 exits and 2, 923 weak classi- 
fiers are trained with each of the methods listed above. In 
order to ensure a fair comparison, we have used the same 
cascade structure and same number of weak classifiers for 
all the compared learning methods. The indexes of exits are 
pre-set to simplify the training procedure. 

For our FisherBoost and LACBoost, we have an im- 
portant parameter 9, which is chosen from { yb" j T2 1 15 > 
5B> 5S' M' 35' We have not carefully tuned this pa- 
rameter using cross-validation. Instead, we train a 10-node 
cascade for each candidate 9, and choose the one with 
the best training accuracy^] At each exit, negative examples 
misclassified by current cascade are discarded, and new 
negative examples are bootstrapped from the background 
images pool. In total, billions of negative examples are 
extracted from the pool. The positive training data and val- 
idation data keep unchanged during the training process. 

Our experiments are performed on a workstation with 
8 Intel Xeon E5520 CPUs and 32GB RAM. It takes about 
3 hours to train the multi-exit cascade with AdaBoost or 
AsymBoost. For FisherBoost and LACBoost, it takes less 
than 4 hours to train a complete multi-exit cascade]^] In 
other words, our EG algorithm takes less than 1 hour to 
solve the primal QP problem (we need to solve a QP 
at each iteration). As an estimation of the computational 
complexity, suppose that the number of training examples 
is m, number of weak classifiers is n. At each iteration of 
the cascade training, the complexity of solving the primal 
QP using EG is 0(mn + kn 2 ) with k the iterations needed 
for EG's convergence. The complexity for training the weak 
classifier is 0(md) with d the number of all Haar-feature 

3. To train a complete 22-node cascade and choose the best 9 on 
cross-validation data may give better detection rates. 

4. Our implementation is in C++ and only the weak classifier 
training part is parallelized using OpenMP. 



patterns. In our experiment, m = 10, 000, n w 2900, 
d = 160, 000, k < 500. So the majority of the computational 
cost of the training process is bound up in the weak 
classifier training. 

We have also experimentally observed the speedup of 
EG against standard QP solvers. We solve the primal QP 
defined by p5) using EG and Mosek (45) . The QP's size is 
1, 000 variables. With the same accuracy tolerance (Mosek's 
primal-dual gap is set to 10~ 7 and EG's convergence tol- 
erance is also set to 10~ 7 ), Mosek takes 1.22 seconds and 
EG is 0.0541 seconds on our standard Desktop. So EG is 
about 20 times faster. Moreover, at iteration n+1 of training 
the cascade, EG can take advantage of the last iteration's 
solution by starting EG from a small perturbation of the 
previous solution. Such a warm-start gains a 5 to 10 x 
speedup in our experiment, while there is no off-the-shelf 
warm-start QP solvers available yet. 

We evaluate the detection performance on the MIT+CMU 
frontal face test set. This dataset is made up of 507 frontal 
faces in 130 images with different background. 

If one positive output has less than 50% variation of 
shift and scale from the ground-truth, we treat it as a true 
positive, otherwise a false positive. 

In the test phase, the scale factor of the scanning window 
is set to 1.2 and the stride step is set to 1 pixel. Two 
performance metrics are used here: one for each node and 
one for the entire cascade. The node metric is how well the 
classifiers meet the node learning objective, which provides 
useful information about the capability of each method to 
achieve the node learning goal. The cascade metric uses 
the receiver operating characteristic (ROC) to compare the 
entire cascade's performance. Note that multiple factors 
impact on the cascade's performance, however, including: 
the classifier set, the cascade structure, bootstrapping etc. 

Fig.[5]shows the false-negative rates for the various forms 
of node classifiers when applied to the MIT+CMU face 
data. The figure shows that FisherBoost and LACBoost 
exhibit significantly better node classification performance 
than the post-processing approach, which verifies the ad- 
vantage of selecting features on the basis of the node 
learning goal. Note that the performance of FisherBoost and 
LACBoost is very similar, but also that LDA or LAC post- 
processing can considerably reduce the false negative rates 
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Fig. 5: Node performances on the validation data for face detection. "Ada" means that features are selected using AdaBoost; "Asym" means 
that features are selected using AsymBoost. 



over the standard Viol-Jones' approach, which corresponds 
with the findings in E). 

The ROC curves in Fig. [6] demonstrate the superior per- 
formance of FisherBoost and LACBoost in the face detection 
task. Fig. [6] also shows that LACBoost does not outperform 
FisherBoost in all cases, in contract to the node performance 
(detection rate) results. Many factors impact upon final 
performance, however, and these sporadic results are not 
seen as being particularly indicative. One possible cause is 
that LAC makes the assumption of Gaussianity and sym- 
metry data distributions, which may not hold well in the 
early nodes. Wu et al. have observed the same phenomenon 
that LAC post-processing does not outperform LDA post- 
processing in some cases. 

The error reduction results of FisherBoost and LACBoost 
in Fig. [6] are not as great as those in Fig. [5] This might 
be explained by the fact that the cascade and negative 
data bootstrapping are compensating for the inferior node 
classifier performance to some extent. 

We have also compared our methods with the boosted 
greedy sparse LDA (BGSLDA) in |11|, [46], which is consid- 
ered one of the state-of-the-art. FisherBoost and LACBoost 
outperform BGSLDA with AdaBoost/ AsymBoost in the de- 
tection rate. Note that BGSLDA uses the standard cascade. 

5.3 Pedestrian Detection Using a Cascade Classifier 

In this experiment, we use the INRIA pedestrian data |29 | 
to compare the performance of our algorithms with other 
state-of-the-art methods. There are 2, 416 cropped mirrored 
pedestrian images and 1, 200 large background images in 
the training set. The test set contains 288 images containing 
588 annotated pedestrians and 453 non-pedestrian images. 

Each training sample is scaled to 64 x 128 pixels with 16 
pixels additional borders for preserving the contour infor- 
mation. During testing, the detection scanning window is 
resized to 32 x 96 pixels to fit the human body. We have 
used the histogram of oriented gradient (HOG) features 
in our experiments. Instead of using fixed-size blocks (105 
blocks of size 16 x 16 pixels) as in Dalai and Triggs |29|, 
we define blocks with different scales (minimum 12 x 12, 
and maximum 64 x 128) and width-length ratios (1:1,1: 
2, 2 : 1, 1 : 3, and 3 : 1). Each block is divided into 2x2 
cells, and the HOG in each cell are summarized into 9 
bins. Thus, totally 36-dimensional features are generated 



for each block. There are in total 7, 735 blocks for a 64 x 128- 
pixel image, ^i-norm normalization is then applied to the 
feature vector. Furthermore, we use integral histograms to 
speed up the computation as in |47| . At each iteration, we 
randomly sample 10% of the whole possible blocks for 
training a weak classifier. We have used weighted linear 
discriminant analysis (WLDA) as weak classifiers, same 
as in (13) . Zhu et al. used linear support vector machines 
as weak classifiers ]47) , which can also be used as weak 
classifiers here. 

For all the approaches evaluated, we use the same 
cascade structure with 21 nodes and totally 612 weak 
classifiers (the first three nodes have four weak classifiers 
for each, and the last six have 60 weak classifiers). 

The positive examples are from the INRIA training set 
and remain the same for each node. The negative examples 
are obtained by collecting the false positives of currently 
learned cascade from the large background images with 
bootstrapping. The parameter 8 of our FisherBoost and 
LACBoost is selected from ~, ~, ^ ^ |}. We have 
not carefully selected 8 in this experiment. Ideally, cross- 
validation should be used to pick the best value of 8 by us- 
ing an independent cross-validation data set. Here because 
INRIA data set does not have many labeled positive data, 
we have used the same 2, 416 training positives, plus 500 
additional negative examples obtained by bootstrapping for 
validation. Improvement might be obtained if a large cross- 
validation data set was available. 

The scale ratio of input image pyramid is 1.09 and the 
scanning step-size is 8 pixels. The overlapped detection 
windows are merged using the simple heuristic strategy 
proposed by Viola and Jones (6j. It takes about 5 hours to 
train the entire cascade pedestrian detector on the worksta- 
tion. 

For the same reason described in the face detection sec- 
tion, the FisherBoost/LACBoost and Wu et al.'s LDA/ LAC 
post-processing are applied to the cascade from about the 
3th node, instead of the first node. 

Since the number of weak classifiers of our pedestrian 
detector is small, we use the original matrix Q rather than 
the approximate diagonal matrix in this experiment. 

The Pascal VOC detection Challenge criterion [13], |48| 
is adopted here. A detection result is considered true or 
false positive based on the area of overlap with the ground 
truth bounding box. To be considered a correct detection, 
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Fig. 6: Cascade performances using ROC curves (number of false positives versus detection rate) on the MIT+CMU test data. "Ada" means that 
features are selected using AdaBoost. Viola-Jones cascade is the method in |6j. "Asym" means that features are selected using AsymBoost. 
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Fig. 7: Node performances on the validation data (INRIA pedestrian detection). "Ada" means that features are selected using AdaBoost; "Asym' 
means that features are selected using AsymBoost. 



the area of overlap between the predicted bounding box 
and ground truth bounding box must exceed 40% of the 
union of the prediction and the ground truth. We use the 
false positives per image (FPPI) metric as suggested in |49|. 

Fig. [7] shows the node performances of various configu- 
rations on the INRIA pedestrian data set. Similar results are 
obtained as in the face detection experiment: again, our new 
boosting algorithms significantly outperform AdaBoost \6\, 
AsymBoost |7|, and are considerably better than Wu et ol.'s 
post-processing methods (2) at most nodes. Compared with 
the face detection experiment, we obtain more obvious im- 
provement on detecting pedestrians. This may be because 
pedestrian detection is much more difficult and there is 
more room for improving the detection performance. 

We have also compared the ROC curves of complete 



cascades, which are plotted in Fig. |8] FisherBoost and 
LACBoost perform better than all other compared methods. 
In contrast to the results of the detection rate for each 
node, LACBoost is slightly worse than FisherBoost in some 
cases (also see Fig. [9j. In general, LAC and LDA post- 
processing improve those without post-processing. Also we 
can see that LAC post-processing performs slightly worse 
than other methods at the low false positive part. Probably 
LAC post-processing over-fits the training data in this 
case. Also, in the same condition, FisherBoost/LDA post- 
processing seems to perform better than LACBoost/LAC 
post-processing. We will discuss this issue in the next 
section. 

In summary, FisherBoost or LACBoost has superior per- 
formance than all the other algorithms. We have also com- 
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Fig. 8: Cascade performances in ROC (false positives per image versus detection rate) on the INRIA pedestrian data. "Ada" means that features 
are selected using AdaBoost. Viola-Jones cascade is the method in |6j with weighted LDA on HOG as weak classifiers. "Asym" means that 
features are selected using AsymBoost. 
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Fig. 9: Comparison between FisherBoost, LACBoost, HOG with the 
linear SVM of Dalai and Triggs 129 1, and Pyramid HOG with the 
histogram intersection kernel SVM (IKSVM) of Maji et al. (18). In 
detection rate, our FisherBoost improves Dalai and Triggs' approach 
by over 7% at FPPI of 0.3 on INRIA's pedestrian detection dataset. 



pared FisherBoost and LACBoost with HOG with linear 
SVM of Dalai and Triggs (29), and the state-of-the-art 
on pedestrian detection — the pyramid HOG (PHOG) with 
histogram intersection kernel SVM (IKSVM) [18]. We keep 
all experiment configurations the same, except that HOG 
with linear SVM and PHOG with IKSVM have employed 
sophisticated mean shift to merge overlapped detection 



windows, while ours use the simple heuristic of Viola 
and Jones (6J. The results are reported in Fig. [9] and the 
observations are: 

1) LACBoost performs similarly to Dalai and Triggs' 

(29); 

2) PHOG with nonlinear IKSVM performs much better 
than Dalai and Triggs' HOG with linear SVM. This is 
consistent with the results reported in (18) ; 

3) FisherBoost performs better than PHOG with IKSVM 
at the low FPPI part (lower than 0.5). 

Note that our FisherBoost and LACBoost use HOG, instead 
of PHOG. It is not clear how much gain PHOG has con- 
tributed to the final detection performance in the case of 
PHOG plus IKSVM of 

In terms of efficiency in the test phase for each method, 
our FisherBoost or LACBoost needs about 0.7 seconds on 
average on the INRIA test data (no image re-scaling is 
applied and single CPU core is used on our workstation). 
PHOG with IKSVM needs about 8.3 seconds on average. 
So our boosting framework is about 14 times faster than 
PHOG with IKSVM. Note that HOG with linear SVM is 
much slower (50 to 70 times slower than the boosting 
framework), which agrees with the results in |47|. 

5. On object categorization, PHOG seems to be a better descriptor 
than HOG |50|. It is likely that our detectors may perform better if we 
replace HOC with PHOG. We leave this as future work. 
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In both face and pedestrian detection experiments, we 
have observed that FisherBoost performs slightly better 
than LACBoost. We try to elaborate this on the following. 

5.4 Why LDA Works Better Than LAC 

Wu et al. observed that in many cases, LDA post-processing 
gives better detection rates on MIT+CMU face data than 
LAC |2|. When using the LDA criterion to select Haar fea- 
tures, Shen et al. |46| tried different combinations of the two 
classes' covariance matrices for calculating the within-class 
matrix: C w = Si + 7S2 with 7 a nonnegative constant. It 
is easy to see that 7 = 1 and 7 = correspond to LDA and 
LAC, respectively. They found that setting 7 £ [0.5, 1] gives 
best results on the MIT+CMU face detection task (llj, |46|. 

According to the analysis in this work, LAC is optimal if 
the distribution of [hi (x) , h 2 (x) , • • • , h n (x)] on the negative 
data is symmetric. In practice, this requirement may not 
be perfectly satisfied, especially for the first several node 
classifiers. At the same time, these early nodes have much 
more impact on the final cascade's detection performance 
than other nodes. This may explain why in some cases the 
improvement of LAC is not significant. However, this does 
not explain why LDA (FisherBoost) works; and sometimes 
it is even better than LAC (LACBoost). At the first glance, 
LDA (or FisherBoost) by no means explicitly considers the 
imbalanced node learning objective. Wu et al. did not have 
a plausible explanation either jlj, J2). 

Proposition 5.1. For object detection problems, the Fisher linear 
discriminant analysis can be viewed as a regularized version of 
linear asymmetric classification. In other words, linear discrim- 
inant analysis has already considered the asymmetric learning 
objective. 

Proof. For object detection such as face and pedestrian 
detection considered here, the covariance matrix of the 
negative class is close to a scaled identity matrix. In theory, 
the negative data can be anything other than the target. Let 
us look at one of the off-diagonal elements 



E iiji?y = E[(hi(x) - E[/i i (x)])(/i J -(x) - E[hj(x) 

= E[/l;(x)^(x)]«0. 



Here x is the image feature of the negative class. We can 
assume that x is i.i.d. and approximately, x follows a uni- 
form distribution. So E[/i i : j(x)] = 0. That is to say, on the 
negative class, the chance of hi j(x) = +1 or 7ijj(x) = —1 
is the same, which is 50%. Note that this does not apply 
to the positive class because x of the positive class is not 
uniformly distributed. The last equality of (36) uses the 



fact that weak classifiers hi(-) and hj{-) are approximately 
statistically independent. Although this assumption may 
not hold in practice as pointed out in |41|, it could be a 
plausible approximation. 

Therefore, the off-diagonal elements of £ are almost 
all zeros; and S is a diagonal matrix. Moreover in object 
detection, it is a reasonable assumption that the diagonal 
elements E[/ij(x)/ij(x)] (j = 1, 2, • ••) have similar values. 
Hence, £2 ~ vl holds, with v a positive constant. 




20 40 60 80 100 
covariance of weak classifers on non-pedestrian data 

Fig. 10: The covariance matrix of the first 112 weak classifiers selected 
by FisherBoost on non-pedestrian data. It may be approximated by a 
scaled identity matrix. On average, the magnitude of diagonal elements 
is 20 times larger than those off-diagonal elements. 



So for object detection, the only difference between LAC 
and LDA is that, for LAC C w = and for LDA C w = 

+ v ■ ^a-L This concludes the proof. □ 

It seems that this regularization term can be the reason 
why the LDA post-processing approach and FisherBoost 
works even better than LAC and LACBoost in object de- 
tection. However, in practice, the negative data are not nec- 
essarily uniformly distributed. Particularly, in latter nodes, 
bootstrapping makes negative data to be those difficult 
ones. In this case, it may deteriorate the performance by 
completely ignoring the negative data's covariance infor- 
mation. 

In FisherBoost, this regularization is equivalent to have 
a £ 2 norm regularization on the primal variable w, ||w||2, 
in the objective function of the QP problem in Section [4] 
Machine learning algorithms like Ridge regression use £2 
norm regularization. 



Fig. 10 shows some empirical evidence that £2 is close 



to a scaled identity matrix. As we can see, the diagonal 
elements are much larger than those off-diagonal elements 
(off-diagonal ones are close to zeros). 



(36) 6 Conclusion 



By explicitly taking into account the node learning goal 
in cascade classifiers, we have designed new boosting 
algorithms for more effective object detection. Experiments 
validate the superiority of the methods developed, which 
we have labeled FisherBoost and LACBoost. We have also 
proposed the use of entropic gradient descent to efficiently 
implement FisherBoost and LACBoost. The proposed algo- 
rithms are easy to implement and can be applied to other 
asymmetric classification tasks in computer vision. We aim 
in future to design new asymmetric boosting algorithms by 
exploiting asymmetric kernel classification methods such 
as 151). Compared with stage-wise AdaBoost, which is 
parameter-free, our boosting algorithms need to tune a 
parameter. We are also interested in developing parameter- 
free stage-wise boosting that considers the node learning 
objective. 
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